1. Introduction

Zillow, as one of the leading real estate online marketplaces in the United States, aims to improve the accuracy of its housing market predictions (aka Zestimate®) so to empower consumers with data and knowledge for decision-making. The main purpose of this project is to help Zillow build a predictive model of home prices in Nashville that is more robust, accurate, and generalizable.

The prediction for home sale price serves as the starting point to determine a home’s value. Based on public and user-submitted data, the prediction is more like a community-participatory process of data mining. It is indispensable because it offers valuable information for the housing market and real estate development in a certain area. Generally speaking, it ties closely to public concern when it comes to housing affordability, cost of living, and the stability of real estate market.

The difficulties of this project lies partially in the limited availability of data for Nashville. In addition, there are so many special features and local conditions that might potentially contribute to the variation of the home sale value in a specific area, that omission seems inevitable, and identifying the best predictors becomes complicated. More importantly, it is never easy to adjust the model without overfitting and to strike a balance between accuracy and generalizability.

Our overall modelling strategy is firstly to collect by category as many feature data as possible to make sure we take into account different conditions and aspects that are possibly correlated with home sale price. With all the variables gathered we then try to build linear regression models, and based on the results of the model we make adjustments to refine the model. Based on the outcomes we can make adjustments to refine the model. Moreover, through additional analysis including cross validation and Moran’s I test, we can ensure that our final model is comparatively accurate and generalizable.

In total we select 9 variables as predictors in our final model, including the building condition, building type, built year, number of rooms, square footage of finished area, distance to nearest 10 police incidents, distance to the nearest top-20 elementary school,the spatial lag of home sales price, and neighborhood code for assessment. Our final model can account for approximately 68 percent of home sale price. The model’s average percent error is around 34%.

2. Data

2.1 Data dictionary

Theoretically, there are four main components of house price:

  • Internal characteristics (e.g. Areas, Number of rooms… )
  • Access to amenity (e.g. Distance to bus stops, crimes, and parks)
  • Social environment (e.g. Census median household income, census racial ratio)
  • Spatial autocorrelation (e.g. Comparables house prices around it)

Based on the theory, we gathered the following 29 variables. For calculation purpose, some of the original variables were recoded and transformed. We input them into the OLS regression model one by one. The final model only contains the significant variables, which are highlighted in the list.

Dependent variable

“Saleprice” - Home Sales Price (Log-transformed for calculation purpose)

Independent variables
- Internal characteristics:

“Neighborho” - The neighborhood code (NBC) used by the Assessor’s office (Categorical)
“LandUseFul” - Type of land unit (Categorical)
“Building_T2” - Type of building (Categorical)
“Exterior_W” - Exterior wall type (Categorical)
“roomsunits” - Number of rooms in the building (Numeric)
“effyearbuilt_building2” - The property built year (Categorical)
“bedroomsun” - Number of bedrooms (Numeric)
“baths” - Number of baths (Numeric)
“ac_sfyi” - whether has central air or not (Dummy)
“sf_finished” - Square footage of finished area (Numeric)
“Phys_Depre2” - Building condition (Categorical)

- Access to amenity:

“Incident1320ft” - the number of incidents within each property’s 1320ft buffer (Numeric)
“dis_10_crime” - Average distance to the property’s 10 nearest incidents (Numeric)
“Distance_to_capitol” - Distance to the TN State Capitol (Numeric)
“Stops” - The number of bus stops within each property’s 1320ft buffer (Numeric)
“inside_or_not” - Whether the property is inside any bus stop’s 1320ft buffer (Dummy)
“d_RAIL_1” - The distance to the nearest railway (Numeric)
“Rail_ft2000” - Whether the property is inside any railway’s 2000ft buffer (Dummy)
“d_PARK_1” - The distance to the nearest park (Numeric)
“d_PARK_3” - The average distance to the 3 nearest park (Numeric)
“d_eleschool20_1” - The distance to the nearest top 20 elementary school (Numeric)

- Social environment at block group level:

“POP_DEN_sqmi” - The population density Per Sq. Mile (Numeric) “MedianAge” - The median age (Numeric)
“AboveBach” - The percentage of residents with a degree above the bachelor (Numeric)
“MeHHInc” - Median Household Income (Numeric)
“VacancyR” - Vacancy rate (Numeric)
“BlackR” - The percentage of Black or African American (Numeric)

- Spatial structure of sale price:

“LAG_SP_5” - The average price of the nearest 5 properties (Numeric)
“LAG_SP10” - The average price of the nearest 10 properties (Numeric)

All data comes from four souces:

For some categorical variables, in order to make sure that there are sufficient observations in each category, we recode them into fewer categories:

  • In “Building_T2”, we recoded: modular home, duplex and zero lot line as “0”, and recoded condos and single family as “1”
  • In “Phys_Depre2”,we recoded: average building condition as “1”, Dilapidated, Poor and Very Poor as “2”, Excellent, Very Good, Good as “3”
  • In the variable “effyearbuilt_building2”, we recoded: “0”(Before 1960), “1”(1960-1970), “2”(1970-1980), “3”(1980-1990), “4”(1990-2000), “5”(2000-2008), “6”(2008-2018)

2.2 Summary statistics

In the summary table, we only summarize the statistically significant variables that are retained in the final model.

##
## Summary Statistics for Variables
## ============================================================================================
## Statistic                N      Mean      St. Dev.     Min   Pctl(25)   Pctl(75)     Max
## --------------------------------------------------------------------------------------------
## SalePrice              9,001 322,473.400 336,834.100  1,562   148,400   389,000   6,894,305
## LocationZi             9,001 37,178.710    69.116    37,013   37,204     37,212     37,221
## WGS1984X               9,001   -86.753      0.075    -86.924  -86.809   -86.709    -86.574
## WGS1984Y               9,001   36.138       0.060    36.025   36.089     36.184     36.281
## test                   9,001    0.000       0.000       0        0         0          0
## sf_finished            9,001  1,723.675    969.051      0      1,156     2,161      10,608
## roomsunits             9,001    5.533       2.234       0        5         7          19
## Phys_Depre2            8,416    1.032       0.212     1.000    1.000     1.000      3.000
## effyearbuilt_building2 9,001    3.519       2.235       0        2         6          6
## Building_T2            8,378    0.974       0.159     0.000    1.000     1.000      1.000
## Neighborho             9,001  4,046.422   1,718.968    107     3,179     4,830      9,336
## dis_10_crime           9,001  3,734.625   2,171.216    33      2,249     4,705      17,406
## d_eleschool20_1        9,001  9,934.991   6,839.600  277.663 5,026.239 13,662.850 35,342.080
## LAG_SP_5               9,001 290,033.700 257,085.000 17,800   142,580   352,440   5,714,444
## --------------------------------------------------------------------------------------------

2.2 Correlation matrix

The Correlation Matrix allows us to check the independence between predictors. In OLS regression model, the predictors should not be very strongly correlated with each other. If the correlation is greater than 0.8, one predictor is interfering with the another, and our coefficient estimation would be inaccurate. The following correlation matrix indicates that no severe correlation exists among our variables. Therefore, all the variables could be input in the regression model.

2.3 Map of the Dependent variable


2.4 Map of three interesting independent variables

In this part, we plot three independent variables:

  • Each Property’s Distance to its Nearest Top20 Elementary School: We assume the school quality matters to the sale price. The closer to good schools, the higher the price is expected to have.
  • Each Property’s Average Distance to Nearest 10 Crime Incidents: We assume the safety matters to the sale price. The further to criminal incidents, the higher the price is expected to have.
  • The Average Sale Price of Each Property’s 5 Nearest Neighbours: We assume the house price is spatially correlated. The price of a property is influenced by its neighbors.


2.5 Maps of other interesting variables

In this part, we do the following three things:

  • Plot between the numeric variables and the dale price, so as to roughtly observe the potential relationship between them.
  • Make the boxplot of sale price for categorical variables, so as to observe the price difference between different categorical levels.
  • Make the histogram of some variables to observe the data distribution, then log the non-normally distributed variables.

2.5.1 Relationship between numeric variables and Sale Price

We are especially interested in the relationship between each predictor and our dependent variable. Hence, we made several scatterplots to visualize the relationship.

As can be seen from the plots, the variation range of home sale price gets smaller as the distance to a good elementary school increases, or as the building age goes up, while the range grows lager as the number of rooms increases. Apparently, home sale price is positively associated with the average distance to crime incidents, the square footage of finished area, or the average price of its 5 neighbors.

Note: To better visulize the relationship between the built year and the saleprice, here we use the Building Age instead of effyearbuilt_building.


2.5.2 Dummy variables

Then we made boxplot of sale price for difference categorical levels.

The first categorical variable is the building physical condition. As it is shown, in genreal, good condition buildings generally have relatively high sale price. Average condition buildings have relatively medium sale price. Poor condition buildings have relatively low sale price. The second categorical variable is the building type. Residential condo and single family house tend to have relatively higher sale price.


2.5.3 Before and after log transformation of four variables

As the histograms shows, the original distributions of the Sale Price, and three independent variables - average distance to crime incidents, square footage of finished area, and the average price of 5 neighbors, are all skewed. After log transformation they become much more normally distributed (for variable sf_finished and LAG_SP_5 we take log(x+1) because there are value zeros in the original dataset)

3. Method

After gathering enough data that might be correlated with home sale price, we start with some exploratory analyses. We then move on to test the potential statistical relationships between each feature variable and the home sale price across the city. The entire dataset are divided into two subsets, one for training (75%) and the other for testing (25%). We use the training dataset to build regression models and examine the outputs. Based on the coefficient results, some variables are removed from the original model and some others are added in. By going back and forth for several time, we then come up with a model in which each predictor variable is statistically significant. Then we test the model on the test dataset. Some index, such as MAE and MAPE help us evaluate how well the model is. Additional analyses includes cross validation, moran’s I test, etc, are also necessary to make sure the regression assumptions are mot violated and that our final model is relatively accurate and generalizable.


4. Result

4.1 Out-of-sample Prediction Model Results Summary

We randomly select 75% data from the original dataset as the training data, with the remaining 25% serving as the test data. We build the model based on the training data, and test it on the unseen test data. Below is the table of the regression results of the model built on training set.

As it is shown in the table, adjusted R-squared is 0.6549, meaning 65.49% of the variance in the sale price can be explained by the model. The coefficients tell us one unit change in the predictor is associated with how much change, and the change direction of the dependent variable. If the P-value is smaller than 0.05, the predictor is statistically significant in the model.

Adjusted R-squared: 0.6549; Multiple R-squared: 0.6828
F-statistic: 24.5 on 523 and 5952 DF, p-value: < 2.2e-16
Residual standard error: 0.4361 on 5952 degrees of freedom
Variables Estimate Std..Error t_value P_value
(Intercept) 7.5029 0.3870 19.407 < 2e-16 ***
dis_10_crime 0.0364 0.0129 2.826 0.004730 **
d_eleschool20_1 -0.0001 0.0000 -2.030 0.042385 *
effyearbuilt_building21 0.0618 0.0300 2.063 0.039112 *
effyearbuilt_building22 0.0576 0.0311 1.851 0.064258 .
effyearbuilt_building23 0.0890 0.0275 3.230 0.001244 **
effyearbuilt_building24 0.0886 0.0262 3.379 0.000733 ***
effyearbuilt_building25 0.1220 0.0292 4.180 2.96e-05 ***
effyearbuilt_building26 0.0447 0.0268 1.668 0.095418 .
Building_T21 0.3140 0.0443 7.083 1.58e-12 ***
roomsunits 0.0240 0.0061 3.917 9.07e-05 ***
sf_finished 0.4530 0.0272 16.616 < 2e-16 ***
Phys_Depre22 -0.1540 0.0454 -3.386 0.000715 ***
Phys_Depre23 0.1750 0.0714 2.454 0.014151 *
LAG_SP_5 0.1150 0.0173 6.647 .26e-11 ***
Neighborho126 -0.3360 0.2580 -1.301 0.1934
Neighborho226 -0.9930 0.2600 -3.825 0.000132 ***
Neighborho306 -0.2710 0.5050 -0.537 0.5913
Neighborho326 -1.1900 0.2920 -4.061 4.95e-05 ***
Neighborho726 -1.0300 0.5050 -2.046 0.040755 *
Neighborho1011 0.1020 0.4010 0.253 0.8
Note:
Since the full table is very long, we only attach parts of the table here.The complete table can be accessed via: https://drive.google.com/open?id=17G7LvULk22jwraDmCadUlZD_f30IEUzf

4.2 Test set Result Summary

The model built with the training data are then used to predict home sale price for the test data, to see how the model performs on new randomly selected dataset. If we can predict well for the test set, the model is generalizable.The outcome are summarized as below.

Dataset MAE MAPE R_squared
Training set 84065 0.352 0.683
Test set 93104 0.349 0.665

Mean absolute error (MAE) and mean absolute percent error (MAPE) are frequently used to measure of the differences between predicted values and the observed values. Generally, the smaller the MAE and MAPE, the better the prediction is. Similar MAE and MAPE between the training data and the test data indicates that the model is generalizable.

The MAE for the training data is 84065, and that of test data is 93104.
The mean absolute percent error (MAPE) for training and test data are both around 0.35.
The R squared for the training data is 0.68, which means approximately 68 percent of the variance in home sale price in training set can be explained by the model. The R squared for the test set is close, indicating that the model can account for approximately 66 percent of the variance in home sale price.
The comparison indicates us that the model has similar performance on both the known data and unknown data, thus, it is generalizable.

4.3 Cross Validation

To make sure that the previous out-of sample test is not just a lucky thing, we run a 100-fold cross validation to test the performance on 100 unseen data samples to see if the model is generalizable or not.

As it is shown, the average root-mean-square error (RMSE) is around 0.44. RMSE is a frequently used measure of the differences between values predicted by a model and the observed values. The smaller the RMSE, the better the prediction is.

The average R-squared is around 0.65, meaning on the average, around 65% of the variance in the sale price can be explained by the model. The distribution of R-squared can be observed in the following histgram. It is easy to find that the majority of r-squared value are within 0.6 and 0.75. But there are still a considerable amount of them distributed around 0.4 and 0.8, which implies that our model could be more improved.

The average MAE is around 0.3. Since the dependent variable has been logged in the model, here the MAE is the absolute difference between the logged observed price and the logged predicted price.

## Linear Regression
##
## 6476 samples
##    9 predictor
##
## No pre-processing
## Resampling: Cross-Validated (100 fold)
## Summary of sample sizes: 6412, 6412, 6411, 6411, 6411, 6412, ...
## Resampling results:
##
##   RMSE       Rsquared   MAE
##   0.4392914  0.6535187  0.2983308
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
##   Mean.of.Rsquared Std.of.RSquared
## 1        0.6535187      0.08515152


4.4 Other Visualization

If our model predict well, the predicted sale price should close to the corresponding observed sale price. The below graph plots shows the predicted sales price as a function of observed sales price. The difference between the predicted value and the observed value indicating that our model has room for improvement.

 

Then, let’s test the spatial autocorrelation. First we plot the residuals on the map. It is shown that the residuals are randomly distributed across the entire city, without spatial cluster of large or small residuals.
Moran’s I test results:
Then we test the residual’s Moran’s I. The residual’s Moran’s I statistic is -0.045, which is pretty close to zero, indicating that there is very slight spatial autocorrelation in the residuals. Plus, the P-value is 0.9992, greater than 0.05, meaning the spatial autocorrelation is statistically insignificant. Thus, the residuals are randomly distributed. The model doesn’t violate the regression assumption of residual randomness.

##
##  Moran I test under randomisation
##
## data:  regResiduals_test$residuals
## weights: nb2listw(spatialWeights, style = "W")
##
## Moran I statistic standard deviate = -3.1469, p-value = 0.9992
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance
##     -0.0456202686     -0.0005260389      0.0002053445

 

Finally, we plot the predicted price for the entire dataset. It shows similar distribution of the observed sale price: high sale prices tend to cluster in the South, especially the Southwest Nashville. Low sale prices tend to cluster in the Northwest.

4.5 Summary by zip code districts

We then look at the variation in mean absolute percentage error (MAPE) by zipcode, which reflects the discrepancy in model accuracy across different zip code districts. As it can be observed from the map below, generally the zip code districts with higher sale price tend to have less MAPE, which is prefered. However, zip code districts with lower sale price tend to have larger MAPE. The variance of the MAPE across different districts indicates us that our model still needs to be more generalizable, because currently the model performs better on high price districts than those low price districts, but it is expected to have similar performance among different regions.

The relationship between our model’s performance and the area it predicts for is also be displayed in the plot below. As the mean price goes up, the value of MAPE goes down, which means that our model predicts better in areas with higher mean home sale prices.

4.6 Spatial Cross Validation

Here we run separately three models, each time removing one district from the training set that have relatively high, medium, or low home sale price, and use the model built on the remaining neighborhoods to predict the hold out neighborhood. We can know how well our model is in predicting different type of districts from the mean absolute percentage error (MAPE). The outcome responses to the Part 4.5’s map (MAPE by Zipcode District), with the lowest MAPE in rich district, and relatively high MAPE in median and low-income districts. Therefore, the model performs the best in rich districts, but not so good in poor districts. In the Disussion Part, we will analyze potential reasons for this discrepency.

(For this part code is included)

#Extra Credit
#Zipcode for the hold out neighborhood - rich: 37220; Median: 37211 ; Low: 37217

#Remove the rich: 37205
i <- which(Summaryvariables$LocationZi %in% c("37220"))
rich_train <- Summaryvariables[-i,]
rich_test <- Summaryvariables[i,]
rich_reg <- lm(log(SalePrice) ~Neighborho+dis_10_crime+d_eleschool20_1+effyearbuilt_building2+
                 Building_T2+roomsunits+sf_finished+Phys_Depre2+LAG_SP_5,data=rich_train)

newfactor1 <- which(!(rich_test$Neighborho %in% (rich_train$Neighborho)))
rich_test$Neighborho[newfactor1] <- NA

rich_residual <- data.frame(observed=rich_test$SalePrice,
                            predicted=exp(predict(rich_reg,rich_test))) %>%
  transmute(APE=abs(predicted-observed)/observed,
            AE=abs(predicted-observed),
            observed=rich_test$SalePrice,
            predicted=exp(predict(rich_reg,rich_test)),
            neighborhood="rich")%>%
  na.omit()
mean(rich_residual$APE)
## [1] 0.2054991
mean(rich_residual$AE)
## [1] 94151.06
#Remove the low income: 37217
i <- which(Summaryvariables$LocationZi %in% c("37217"))
low_train <- Summaryvariables[-i,]
low_test <- Summaryvariables[i,]
low_reg <- lm(log(SalePrice) ~Neighborho+dis_10_crime+d_eleschool20_1+effyearbuilt_building2+
                 Building_T2+roomsunits+sf_finished+Phys_Depre2+LAG_SP_5,data=low_train)

newfactor1 <- which(!(low_test$Neighborho %in% (low_train$Neighborho)))
low_test$Neighborho[newfactor1] <- NA

low_residual <- data.frame(observed=low_test$SalePrice,
                            predicted=exp(predict(low_reg,low_test))) %>%
  transmute(APE=abs(predicted-observed)/observed,
            AE=abs(predicted-observed),
            observed=low_test$SalePrice,
            predicted=exp(predict(low_reg,low_test)),
            neighborhood="low")%>%
  na.omit()
 mean(low_residual$APE)
## [1] 0.3224945
 mean(low_residual$AE)
## [1] 34531.47
#Remove the median income: 37211
i <- which(Summaryvariables$LocationZi %in% c("37211"))
median_train <- Summaryvariables[-i,]
median_test <- Summaryvariables[i,]
median_reg <- lm(log(SalePrice) ~Neighborho+dis_10_crime+d_eleschool20_1+effyearbuilt_building2+
                   Building_T2+roomsunits+sf_finished+Phys_Depre2+LAG_SP_5,data=median_train)

newfactor1 <- which(!(median_test$Neighborho %in% (median_train$Neighborho)))
median_test$Neighborho[newfactor1] <- NA

median_residual <- data.frame(observed=median_test$SalePrice,
                              predicted=exp(predict(median_reg,median_test))) %>%
  transmute(APE=abs(predicted-observed)/observed,
            AE=abs(predicted-observed),
            observed=median_test$SalePrice,
            predicted=exp(predict(median_reg,median_test)),
            neighborhood="median")%>%
  na.omit()
 mean(median_residual$APE)
## [1] 0.3508915
 mean(median_residual$AE)
## [1] 62497.67
data.frame(MAPE_rich=mean(rich_residual$APE),
           MAPE_medium=mean(median_residual$APE),
           MAPE_poor=mean(low_residual$APE))
##   MAPE_rich MAPE_medium MAPE_poor
## 1 0.2054991   0.3508915 0.3224945
data.frame(MAE_rich=mean(rich_residual$AE),
           MAE_medium=mean(median_residual$AE),
           MAE_poor=mean(low_residual$AE))
##   MAE_rich MAE_medium MAE_poor
## 1 94151.06   62497.67 34531.47
District MAE MAPE
Rich 94151 0.21
Medium 62498 0.35
Poor 34531 0.32

5. Discussion

Approximately 65.5 percent of variance in dependent variable (home sale price) in the training data can be explained by our model. When it comes to unseen data (the test set), the model also performs well as it accounts for around 65 percent of the observations.

In our final model, some predictors show more significance than others. Firstly, for the internal characteristics of the property, the square footage of finished area has important effect on this model, and is positively associated with the home sale price. In addition, the building condition and building type also have significant impact on the sale price. Secondly, for the access to amenity, there is only two significant predictors retained after we removed all the insignificant ones, those are the distance to the nearest Top-20 elementary school, negatively correlated with the home sale price, and the average distence to the nearest 10 criminal incidents, positively correlated with the home sale price. Plus, another two significant predictors are the average price of 5 nearby home sale price, and the neighborhood fixed effect, which we deem as the most important two variables to fix the spatial autocorrelation.

The model’s MAPE is around 35% when using this model to predict unseen home sale price, which is not very small but yet acceptable. From the map we can observe the spatial variation in both the observed sale price. A large portion of the high sale prices do cluster in the southwestern part of Nashville and another small portion in the northeast area. Since the residual’s Morans’I is very close to zero and is insignificant, the residuals are randomly distributed and the model predicts well in accounting for the spatial pattern of sales price.

As we can see from the map of MAPE by zip code districts, the MAPE is relatively lower in districts with high sale price. In the districts with lower home sale prices, especially those near city center, our model does not perform as well as those high price districts. The reason for this, as we guess, is that for the area near city center, the adjacency to commercial activites and other related assets and facilities, which are not taken into account in our final model, has a non-negligible effect on the home sale price in this area.

6. Conclusion

To conclude, the final model is comparatively robust and generalizable at a citywide level. Because all the data used in this model are open source and accessible by all, and the regression results indicate that the model does well in both accuracy and generalizability, We would recommend it to Zillow for its housing market prediction in city of Nashville.

Nevertheless, our model still has some limitations. Firstly, as mentioned before in the Discussion part, we failed to identify a significant predictor to include in the model that can reflect the commercial capacity and the underlying spatial relationship at the city level. Moreover, as we included Neighborhood code as a fixed effect in the model, the prediction is tailored for the city of Nashville. Though generalizable in the entire area of Nashville, it cannot be directly applied to other cities and regions.

For future improvement, we might explore deeper to identify a whole set of predictor variables that is more comprehensive and applicable for different cities and regions across the country.